SDM with R
SDM with R
1st QCBS R Symposium
Outline
This tutorial illustrates how to build, evaluate and project species distribution models within R. The main steps we will go through, described bellow, are the following:
- R environment preparation
- Data preparation
- Study area and manipulating spatial information
- Environmental data
- Species occurrence data
- Presence-and-absence data
- Presence-only and pseudo-absences data
- Model fitting, prediction, and evaluation
- Making projections
- Multi-species modelling
- DIY: Create a SDM yourself!
- Bonus
R environment preparation
Download the .zip file from here and extract files to your favorite QCBS R Symposium workshop folder.
To run this workshop, please use the latest version of R. A certain number of libraries are also required and will be downloaded. Install and load all required R libraries:
install.packages("rgdal")
install.packages("raster")
install.packages("rgeos")
install.packages("dismo")
install.packages("letsR")
install.packages("biomod2")
install.packages("biogeo")
install.packages("maptools")
install.packages("stringr")
install.packages("ggplot2")
install.packages("plotly")
install.packages("xtable")library(rgdal)
library(raster)
library(rgeos)
library(dismo)
library(letsR)
library(biomod2)
library(biogeo)
library(maptools)
library(stringr)
library(ggplot2)
library(plotly)
library(xtable)Set your working directory to the same directory of the folder workshopSDM using the function setwd() or your prefered method.
Finally, we recommend to save and create a new working environment for this tutorial.
# save all the working space
save.image("QCBS_SDMTutorial.RData")
# free the working space
rm(list = ls())
# and get it back
load("QCBS_SDMTutorial.RData")Data preparation
Selecting your study area and manipulating spatial files
Most species distribution models are done using spatial grid information. In the next step, we will learn how to create a polygon grid of a given region, which will be used within our species distribution model framework.
For this tutorial, we have chosen the Neotropical zoogeographical region as our study area. Load the polygon shapefile of it using the readOGR function:
neotropical_shape <- rgdal::readOGR("data/shapefiles/neotropical.shp")OGR data source with driver: ESRI Shapefile
Source: "data/shapefiles/neotropical.shp", layer: "neotropical"
with 1 features
It has 3 fields
plot(neotropical_shape)Now that we have our study region, we need to create a grid, intersect our shapefile and create a new grid of that region. For this, we will use the modified GridFilter function (originally from Hidasi-Neto, J.). It allows us to input a polygon shapefile, specify a resolution and the proportion of overlap for each cell to be considered.
We will apply it, intersect and crop its features with the polygon neotropical_shape you have loaded.
GridFilter <- function(shape, resol = 1, prop = 0) {
grid <- raster(extent(shape))
res(grid) <- resol
proj4string(grid) <- proj4string(shape)
gridpolygon <- rasterToPolygons(grid)
drylandproj <- spTransform(shape, CRS("+proj=laea"))
gridpolproj <- spTransform(gridpolygon, CRS("+proj=laea"))
gridpolproj$layer <- c(1:length(gridpolproj$layer))
areagrid <- gArea(gridpolproj, byid = T)
gridpolproj <- gBuffer(gridpolproj, byid = TRUE, width = 0)
dry.grid <- intersect(drylandproj, gridpolproj)
areadrygrid <- gArea(dry.grid, byid = T)
info <- cbind(dry.grid$layer, areagrid[dry.grid$layer], areadrygrid)
dry.grid$layer <- info[, 3]/info[, 2]
dry.grid <- spTransform(dry.grid, CRS(proj4string(shape)))
dry.grid.filtered <- dry.grid[dry.grid$layer >= prop, ]
}
# Create a spatial polygon grid for the Neotropical region, with 5 degrees x
# 5 degrees
neotropical_grid <- GridFilter(neotropical_shape, resol = 5, prop = 0.5)
neotropical_grid$ID <- 1:(length(neotropical_grid))# Export your resulting polygon grid
writeOGR(neotropical_grid, dsn = paste(getwd(), "/data/shapefiles", sep = ""),
layer = "neotropical_grid_5", driver = "ESRI Shapefile", overwrite_layer = T)This is our resulting polygon grid, where each cell refers to a site (and a row) in our data.
neotropical_grid <- readOGR("data/shapefiles/neotropical_grid_5.shp")OGR data source with driver: ESRI Shapefile
Source: "data/shapefiles/neotropical_grid_5.shp", layer: "neotropical_grid_5"
with 63 features
It has 5 fields
plot(neotropical_grid)We also need the coordinates of our centroids to proceed with the data analysis.
# Extract coordinates from the cell centroids
myRespCoord <- as.data.frame(coordinates(neotropical_grid))
colnames(myRespCoord) <- c("Longitude_X", "Latitude_Y")
# Export it for the future
write.table(myRespCoord, "data/matrices/NT_coords_5.txt")Importing environmental variables
In species distribution modeling, predictor variables are typically extracted from raster (grid) type files. Each ‘raster’ should represent a given variable of interest, which can be climatic, soil, terrain, vegetation, land use, and other types variables.
The getData function from the dismo package is able to retrieve useful climate, elevation and administrative data.
bioClimVars <- getData("worldclim", var = "bio", res = 10)
bioClimVarsYou can also load your predictor variables from any directory within your computer using the raster (single file) and stack (loads multiple files) functions:
rasterFiles <- list.files("data/rasters/bio_10m_bil", pattern = "bil$", full.names = TRUE)
rasterFiles [1] "data/rasters/bio_10m_bil/bio1.bil"
[2] "data/rasters/bio_10m_bil/bio10.bil"
[3] "data/rasters/bio_10m_bil/bio11.bil"
[4] "data/rasters/bio_10m_bil/bio12.bil"
[5] "data/rasters/bio_10m_bil/bio13.bil"
[6] "data/rasters/bio_10m_bil/bio14.bil"
[7] "data/rasters/bio_10m_bil/bio15.bil"
[8] "data/rasters/bio_10m_bil/bio16.bil"
[9] "data/rasters/bio_10m_bil/bio17.bil"
[10] "data/rasters/bio_10m_bil/bio18.bil"
[11] "data/rasters/bio_10m_bil/bio19.bil"
[12] "data/rasters/bio_10m_bil/bio2.bil"
[13] "data/rasters/bio_10m_bil/bio3.bil"
[14] "data/rasters/bio_10m_bil/bio4.bil"
[15] "data/rasters/bio_10m_bil/bio5.bil"
[16] "data/rasters/bio_10m_bil/bio6.bil"
[17] "data/rasters/bio_10m_bil/bio7.bil"
[18] "data/rasters/bio_10m_bil/bio8.bil"
[19] "data/rasters/bio_10m_bil/bio9.bil"
bioClimVars <- stack(rasterFiles)
bioClimVarsclass : RasterStack
dimensions : 900, 2160, 1944000, 19 (nrow, ncol, ncell, nlayers)
resolution : 0.1666667, 0.1666667 (x, y)
extent : -180, 180, -60, 90 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=longlat +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +no_defs
names : bio1, bio10, bio11, bio12, bio13, bio14, bio15, bio16, bio17, bio18, bio19, bio2, bio3, bio4, bio5, ...
min values : -269, -97, -488, 0, 0, 0, 0, 0, 0, 0, 0, 9, 8, 72, -59, ...
max values : 314, 380, 289, 9916, 2088, 652, 261, 5043, 2159, 4001, 3985, 211, 95, 22673, 489, ...
You can also plot your variables to take a look at them:
plot(bioClimVars/10) # WorldClim data is multiplied by 10. Let's take a look at the real information.Variable extraction
Before extracting variables, it is really important that your rasters have the same projection (i.e. type of coordinate reference system) as your other spatial files. See ?CRS for help in finding CRS definitions.
crs.geo <- CRS("+proj=longlat +ellps=WGS84 +datum=WGS84") # geographical, datum WGS84
proj4string(neotropical_grid) <- crs.geo # define projection system of grid data
proj4string(bioClimVars) <- crs.geo # and of our bioclimatic rastersNow, we can extract the predictor values for each cell of our grid!
myExpl <- data.frame(bio_1 = numeric(length(neotropical_grid)), bio_2 = numeric(length(neotropical_grid)),
bio_3 = numeric(length(neotropical_grid)), bio_5 = numeric(length(neotropical_grid)))
bio_1_ext <- extract(bioClimVars$bio1, neotropical_grid) # bio_1_ext is a list of cells that contains all values for each cell, for each predictor
bio_2_ext <- extract(bioClimVars$bio2, neotropical_grid)
bio_3_ext <- extract(bioClimVars$bio3, neotropical_grid)
bio_5_ext <- extract(bioClimVars$bio4, neotropical_grid)
myExpl$bio_1 <- unlist(lapply(bio_1_ext, function(x) if (!is.null(x)) mean(x,
na.rm = TRUE) else NA))
myExpl$bio_2 <- unlist(lapply(bio_2_ext, function(x) if (!is.null(x)) mean(x,
na.rm = TRUE) else NA))
myExpl$bio_3 <- unlist(lapply(bio_3_ext, function(x) if (!is.null(x)) mean(x,
na.rm = TRUE) else NA))
myExpl$bio_5 <- unlist(lapply(bio_5_ext, function(x) if (!is.null(x)) mean(x,
na.rm = TRUE) else NA))
# Let's take a look at our table:
head(myExpl) bio_1 bio_2 bio_3 bio_5
1 215.9267 138.49084 68.00000 1631.1465
2 236.4301 114.60676 67.33333 1570.8894
3 241.8685 99.48434 65.41545 1568.2213
4 239.1822 91.49031 85.48450 355.5523
5 235.9922 99.52013 80.05817 535.7483
6 269.0044 101.91778 74.06000 727.2778
Save your results and proceed to the next step.
write.table(myExpl, "data/matrices/NT_EnvVar_5.txt")Importing species data
Using expert drawn maps
One way of acquiring species distribution information is from expert-drawn maps, such as the ones from the IUCN Red List for Threatened Species database. These maps Tare usually done used multiple methods: only on occurrence data (e.g. a polygon around known occurrences), whereas others incorporate multiple degrees of knowledge, such as habitat requirements or elevational limits of the species, in essence using an informal distribution modelling approach.
Richness maps derived from expert-drawn range maps may overestimate local richness (‘error of commission’), in relation to point-to-grid richness maps. This because they are generally drawn to include all areas where a species is known to occur without excluding areas in between where the species may not exist. They tend to map the ‘extent of occurrence’ of species that includes the, perhaps much smaller, ‘area of occupancy’ (Loiselle et al., 2003; Habib et al., 2004; Hurlbert & White, 2005; Graham & Hijmans, 2006). We advise caution when using them.
speciesGeoDist <- readShapePoly("data/shapefiles/NT_TERRESTRIAL_MAMMALS_subset.shp",
proj4string = crs.geo)
plot(speciesGeoDist)Create a presence-absence matrix of species’ geographic ranges within your polygon grid shapefile. For this, we will use the function lets.presab.grid from the letsR package.
abspres.matrix <- lets.presab.grid(shapes = speciesGeoDist, grid = neotropical_grid,
sample.unit = "ID", remove.sp = TRUE, presence = NULL, origin = NULL, seasonal = NULL)You can now visualize a map of your species richness within your polygon grid:
richnessCalc <- rowSums(abspres.matrix$PAM) + 1
colPalette <- colorRampPalette(c("#fff5f0", "#fb6a4a", "#67000d"))
colors <- c("white", colPalette(max(richnessCalc)))
plot(abspres.matrix$grid, border = "gray40", col = colors[richnessCalc])
map(add = TRUE)Export your presence-absence matrix:
write.table(abspres.matrix$PAM, "data/matrices/NT_PAM_5.txt")Using occurrence records
We can also use occurrence records available online, or collected in the field. Here, we will use information from the Global Biodiversity Information Facility and import it using the function gbif:
D_youngi.GBIF <- gbif("Diaemus", "youngi")# get data frame with spatial coordinates (points)
Dyoungi.Occ <- subset(D_youngi.GBIF, select = c("country", "lat", "lon"))
head(Dyoungi.Occ) # a simple data frame with coordinates country lat lon
1 Nicaragua 11.11199 -85.76016
2 Trinidad and Tobago 10.54603 -61.48618
3 Suriname 1.99449 -56.09211
4 Suriname 1.99000 -56.09000
5 Suriname 1.99400 -56.09200
6 Suriname NA NA
Issues with occurrence points
Occurrence data from museum and herbarium collections are very valuable for mapping biodiversity patterns, and for species distribution modelling. But these datasets contain a lot of errors and are susceptible to many issues relating data quality. Many of these errors are extensively described in “Biogeo: an R package for assessing and improving data quality of occurrence record datasets”, from Robertson et al., 2016
Let’s take a look at a few of them:
| Error | Probable.Reason |
|---|---|
| Points plotted at zero degrees latitude and longitude. | No coordinates were available in the original dataset but values of zero assigned to the coordinates |
| Points in sea for terrestrial species or on land for aquatic species, obvious geographical outliers. | Transposed latitude and longitude coordinates; incorrect sign on the decimal degrees of the latitude or longitude coordinate; degrees and minutes were transposed before the coordinate was converted to decimal degrees; imprecise locality description used to assign coordinates; the specimen was incorrectly identified by the collector or the incorrect name was applied to the species when the data were digitized. |
| Point in sea but close to coast for terrestrial species, or on land but close to coast for marine species. | Low precision coordinates e.g. only degrees were available or the data were originally collected on a coarse-scale grid. Imprecise locality description used to assign coordinates. |
| Point plotted along the prime meridian or equator. | Missing coordinate for latitude or for longitude that was incorrectly assigned a value of zero. |
| Country name given in the record does not correspond with country where point is plotted. | Likely to be the same errors as for b) above. |
| Elevation given in the record does not correspond with elevation obtained from a digital elevation model where point is plotted. | Likely to be the same errors as for b) above, or the spatial resolution of the digital elevation model is too coarse. |
Data cleaning
Let’s take a look at our species matrix. What is going when we look at the sum of the columns (i.e. number of presences)?
# Load our species data
DataSpecies <- read.table("data/matrices/NT_PAM_5.txt", header = TRUE)
colSums(DataSpecies) Panthera.onca Pecari.tajacu
50 55
Peromyscus.leucopus Phyllostomus.latifolius
3 10
Philander.opossum Phyllomys.nigrispinus
39 3
Peromyscus.mexicanus Phyllomys.dasythrix
3 2
Oxymycterus.amazonicus Oxymycterus.hispidus
8 5
Peropteryx.macrotis Pattonomys.carrikeri
47 1
Oxymycterus.josei Peropteryx.pallidoptera
1 5
Oxymycterus.hiska Peromyscus.aztecus
3 3
Peromyscus.beatae Philander.deltae
3 1
Phylloderma.stenops Peromyscus.furvus
43 1
Oxymycterus.rufus Peromyscus.grandis
7 2
Phaenomys.ferrugineus Peropteryx.kappleri
1 44
Phyllomys.blainvillii Phyllomys.brasiliensis
3 2
Peromyscus.levipes Philander.mondolfii
1 5
Philander.frenatus Phyllomys.mantiqueirensis
7 1
Phyllostomus.elongatus Oxymycterus.nasutus
41 5
Peromyscus.hylocetes Oxymycterus.hucucha
1 2
Oxymycterus.roberti Oxymycterus.dasytrichus
7 3
Oxymycterus.angularis Phyllomys.medius
3 3
Peromyscus.megalops Pattonomys.occasius
2 3
Phyllostomus.hastatus Phyllomys.kerri
44 1
Oxymycterus.quaestor Peropteryx.leucoptera
9 28
Phyllomys.thomasi Peromyscus.yucatanicus
1 2
Ozotoceros.bezoarticus Peromyscus.gratus
21 1
Philander.andersoni Pearsonomys.annectens
12 2
Oxymycterus.delator Philander.mcilhennyi
5 4
Peromyscus.maniculatus Peromyscus.gymnotis
1 1
Perognathus.flavus Oxymycterus.paramensis
1 7
Oxymycterus.inca Phyllostomus.discolor
5 46
Peromyscus.difficilis Peromyscus.melanotis
1 1
Phyllomys.lamarum Pattonomys.semivillosus
3 1
Phyllomys.lundi Peromyscus.stirtoni
1 2
Phyllomys.pattoni Phyllomys.sulinus
4 3
Peromyscus.melanophrys Oxymycterus.caparoae
2 2
We can work on that by removing species that occur in less than four sites.
# Remove all columns that have less than 4 presences
to.remove <- colnames(DataSpecies)[colSums(DataSpecies) <= 4]
`%ni%` <- Negate(`%in%`)
DataSpecies <- subset(DataSpecies, select = names(DataSpecies) %ni% to.remove)
# Replace '_' per '.'
names(DataSpecies) <- gsub(x = names(DataSpecies), pattern = "\\.", replacement = ".")
DataSpecies <- read.table("data/matrices/NT_PAM_5.txt", header = TRUE)Species distribution modelling
Initialising and formatting direct input data
First, input data needs to be rearranged for usage in biomod2 using BIOMOD_FormatingData(). Its most important arguments are: * resp.var contains species data (for a single species) in binary format (ones for presences, zeros for true absences and NA for indeterminated) that will be used to build the species distribution models. It may be a vector or a spatial points object. * expl.var contains a matrix, data.frame, SpatialPointsDataFrame or RasterStack containing your explanatory variables. * resp.xy two columns matrix containing the X and Y coordinates of resp.var (only consider if resp.var is a vector) * resp.name contains the response variable name (species).
# Recall your variables again:
## Load prepared species data
myResp <- read.table("data/matrices/NT_PAM_5.txt", header=TRUE)
## Load environmental variables
myExpl <- read.table("data/matrices/NT_EnvVar_5.txt", header=TRUE)
# Define the species of interest
myRespName <- names(DataSpecies)[1]
# Load coordinates
myRespCoord <- read.table("data/matrices/NT_coords_5.txt", header=TRUE)
rownames(myRespCoord) <- rownames(myResp)
myBiomodData <- BIOMOD_FormatingData(resp.var = as.numeric(DataSpecies[,myRespName]),
expl.var = myExpl,
resp.xy = myRespCoord,
resp.name = myRespName,
#PA.nb.rep = 2,
#PA.nb.absences = 200,
#PA.strategy = 'random',
na.rm = TRUE)
-=-=-=-=-=-=-=-=-=-=-= Panthera.onca Data Formating -=-=-=-=-=-=-=-=-=-=-=
> No pseudo absences selection !
! No data has been set aside for modeling evaluation
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Done -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Now, you may check if your data is correctly formatted by printing and ploting the myBiomodData object.
myBiomodData
-=-=-=-=-=-=-=-=-=-=-=-=-= 'BIOMOD.formated.data' -=-=-=-=-=-=-=-=-=-=-=-=-=
sp.name = Panthera.onca
50 presences, 13 true absences and 0 undifined points in dataset
2 explanatory variables
bio_1 bio_2
Min. : 54.06 Min. : 80.49
1st Qu.:182.15 1st Qu.: 99.89
Median :236.43 Median :117.75
Mean :211.69 Mean :115.79
3rd Qu.:256.98 3rd Qu.:128.26
Max. :269.00 Max. :162.52
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
plot(myBiomodData)Building models
Defining algorithm parameters using default options
We are ready for running our set of models on our species. This step is the core of the modeling procedure.
First, we define the of different options for each modeling technique using the function BIOMOD_ModelingOptions(). If we do not change its parameters, it will take the default values (run Print_Default_ModelingOptions() to see them) defined within biomod2. For the sake of simplicity, we will keep all options default.
myBiomodOption <- BIOMOD_ModelingOptions()
# The created object is then given to BIOMOD_Modeling in the next step.Computing models
Now, we proceed to running the set of models on our species. We may choose between 12 difeerent algorithms (‘GLM’, ‘GBM’, ‘GAM’, ‘CTA’, ‘ANN’, ‘SRE’, ‘FDA’, ‘MARS’, ‘RF’, ‘MAXENT’).
As we do not have evaluation data, we will make 3-fold cross-validation (number controlled by NbRunEval argument) of our models by randomly splitting our data set into 2 subsets DataSplit.
Take a look below:
myBiomodModelOut <- BIOMOD_Modeling(
# BIOMOD_FormatingData()-class object
data = myBiomodData,
# The set of models to be calibrated on the data
models = c('GBM', 'GLM', 'MAXENT.Tsuruoka', 'RF'),
# BIOMOD.models.options object
models.options = myBiomodOption,
# Number of evaluation runs
NbRunEval=3,
# % of data used to calibrate the models, the remaining part will be used for testing
DataSplit=80,
# If this argument is kept to NULL, each observation (presence or absence) has the same weight (independent of the number of presences and absences).
Prevalence=0.5,
# Number of permutations to estimate variable importance
VarImport=3,
# Types of evaluations methods to be used. The complete list is within ?BIOMOD_Modeling()
models.eval.meth = c('TSS','ROC'),
# Keep all results and outputs on hard drive or not. Keep it always TRUE
SaveObj = TRUE,
# If true, all model predictions will be scaled with a binomial GLM
rescal.all.models = FALSE,
# If true, models calibrated and evaluated with the whole dataset are done.
# Building models with all information available may be usefull in some particular
# cases (i.e. rare species with few presences points).
do.full.models = FALSE,
# character, the ID (=name) of modeling procedure. A random number by default
modeling.id = paste(myRespName,"CurrentClim",sep="")) # We call it 'CurrentClim' because the bioclimatic variables we are working with are from the present climate.Model evaluation
Extracting evaluation scores
We are able now to capture our model evaluations using the get_evaluations function. In this tutorial, we have focused on two evaluations methods: Receiving Operator Curves and True Skill Statistic.
# get all models evaluation
myBiomodModelEval <- get_evaluations(myBiomodModelOut) #contains model evaluation scores
# print the dimnames of this object
dimnames(myBiomodModelEval)[[1]]
[1] "TSS" "ROC"
[[2]]
[1] "Testing.data" "Cutoff" "Sensitivity" "Specificity"
[[3]]
[1] "GBM" "GLM" "MAXENT.Tsuruoka" "RF"
[[4]]
[1] "RUN1" "RUN2" "RUN3"
[[5]]
Panthera.onca_AllData
"AllData"
hist(myBiomodModelEval)write.table(myBiomodModelEval, file = "data/results/EvalAll_Model_Eval")Plotting evaluation scores
We can also plot evaluation scores for sensitivity and specifity for all models.
# print the Sensitivity of all models --> TSS and ROC scores for SENSITIVITY
mySensitivity_All_Models <- myBiomodModelEval[, "Sensitivity", , , ]
mySensitivity_All_Models, , RUN1
GBM GLM MAXENT.Tsuruoka RF
TSS 80 40 90 70
ROC 80 40 90 70
, , RUN2
GBM GLM MAXENT.Tsuruoka RF
TSS 80 50 60 70
ROC 80 50 60 70
, , RUN3
GBM GLM MAXENT.Tsuruoka RF
TSS 100 100 90 100
ROC 100 100 90 100
mySensAll <- as.data.frame(t(as.data.frame(mySensitivity_All_Models)))
MySensHist.TSS <- ggplot(mySensAll, aes(x = TSS)) + geom_histogram(aes(y = ..density..),
binwidth = density(mySensAll$TSS)$bw) + geom_density(fill = "red", alpha = 0.2)
MySensHist.ROC <- ggplot(mySensAll, aes(x = ROC)) + geom_histogram(aes(y = ..density..),
binwidth = density(mySensAll$ROC)$bw) + geom_density(fill = "red", alpha = 0.2)
MySensHist.TSSMySensHist.ROC# save results
write.table(mySensitivity_All_Models, file = "data/results/Sensitivity_All_Models"), , RUN1
GBM GLM MAXENT.Tsuruoka RF
TSS 100 100 100 100
ROC 100 100 100 100
, , RUN2
GBM GLM MAXENT.Tsuruoka RF
TSS 66.667 100 100 66.667
ROC 66.667 100 100 66.667
, , RUN3
GBM GLM MAXENT.Tsuruoka RF
TSS 66.667 66.667 66.667 66.667
ROC 66.667 66.667 66.667 66.667
# Proportion of PRESENCES correctly predicted (true positive rate)
sensitivity <- read.table("data/results/Sensitivity_All_Models")
summary(sensitivity) GBM.RUN1 GLM.RUN1 MAXENT.Tsuruoka.RUN1 RF.RUN1 GBM.RUN2
Min. :80 Min. :40 Min. :90 Min. :70 Min. :80
1st Qu.:80 1st Qu.:40 1st Qu.:90 1st Qu.:70 1st Qu.:80
Median :80 Median :40 Median :90 Median :70 Median :80
Mean :80 Mean :40 Mean :90 Mean :70 Mean :80
3rd Qu.:80 3rd Qu.:40 3rd Qu.:90 3rd Qu.:70 3rd Qu.:80
Max. :80 Max. :40 Max. :90 Max. :70 Max. :80
GLM.RUN2 MAXENT.Tsuruoka.RUN2 RF.RUN2 GBM.RUN3
Min. :50 Min. :60 Min. :70 Min. :100
1st Qu.:50 1st Qu.:60 1st Qu.:70 1st Qu.:100
Median :50 Median :60 Median :70 Median :100
Mean :50 Mean :60 Mean :70 Mean :100
3rd Qu.:50 3rd Qu.:60 3rd Qu.:70 3rd Qu.:100
Max. :50 Max. :60 Max. :70 Max. :100
GLM.RUN3 MAXENT.Tsuruoka.RUN3 RF.RUN3
Min. :100 Min. :90 Min. :100
1st Qu.:100 1st Qu.:90 1st Qu.:100
Median :100 Median :90 Median :100
Mean :100 Mean :90 Mean :100
3rd Qu.:100 3rd Qu.:90 3rd Qu.:100
Max. :100 Max. :90 Max. :100
str(sensitivity)'data.frame': 2 obs. of 12 variables:
$ GBM.RUN1 : int 80 80
$ GLM.RUN1 : int 40 40
$ MAXENT.Tsuruoka.RUN1: int 90 90
$ RF.RUN1 : int 70 70
$ GBM.RUN2 : int 80 80
$ GLM.RUN2 : int 50 50
$ MAXENT.Tsuruoka.RUN2: int 60 60
$ RF.RUN2 : int 70 70
$ GBM.RUN3 : int 100 100
$ GLM.RUN3 : int 100 100
$ MAXENT.Tsuruoka.RUN3: int 90 90
$ RF.RUN3 : int 100 100
sensitivity #table of sensitivity by TSS or ROC per run per model GBM.RUN1 GLM.RUN1 MAXENT.Tsuruoka.RUN1 RF.RUN1 GBM.RUN2 GLM.RUN2
TSS 80 40 90 70 80 50
ROC 80 40 90 70 80 50
MAXENT.Tsuruoka.RUN2 RF.RUN2 GBM.RUN3 GLM.RUN3 MAXENT.Tsuruoka.RUN3
TSS 60 70 100 100 90
ROC 60 70 100 100 90
RF.RUN3
TSS 100
ROC 100
sensTSS.plot <- ggplot(sensData, aes(x = model1, y = TSS, fill = model1)) +
geom_boxplot() + guides(fill = FALSE) + ggtitle("Sensitivity using TSS scores") +
theme(plot.subtitle = element_text(vjust = 1), plot.caption = element_text(vjust = 1),
panel.background = element_rect(fill = NA)) + labs(x = "Model Type",
y = "TSS (%)")
sensTSS.plot <- ggplotly(sensTSS.plot)
sensROC.plot <- ggplot(sensData, aes(x = model2, y = ROC, fill = model1)) +
geom_boxplot() + guides(fill = FALSE) + ggtitle("Sensitivity using ROC scores") +
theme(plot.subtitle = element_text(vjust = 1), plot.caption = element_text(vjust = 1),
panel.background = element_rect(fill = NA)) + labs(x = "Model Type",
y = "ROC (%)")
sensROC.plot <- ggplotly(sensROC.plot)
sensTSS.plotsensROC.plotDiscussion: How do the models compare by their sensitivity scores?
##### plot SPECIFICITY #####
# Proportion of ABSENCES correctly predicted (true negative rate)
specificity <- read.table("data/results/Specificity_All_Models")
summary(specificity) GBM.RUN1 GLM.RUN1 MAXENT.Tsuruoka.RUN1 RF.RUN1
Min. :100 Min. :100 Min. :100 Min. :100
1st Qu.:100 1st Qu.:100 1st Qu.:100 1st Qu.:100
Median :100 Median :100 Median :100 Median :100
Mean :100 Mean :100 Mean :100 Mean :100
3rd Qu.:100 3rd Qu.:100 3rd Qu.:100 3rd Qu.:100
Max. :100 Max. :100 Max. :100 Max. :100
GBM.RUN2 GLM.RUN2 MAXENT.Tsuruoka.RUN2 RF.RUN2
Min. :66.67 Min. :100 Min. :100 Min. :66.67
1st Qu.:66.67 1st Qu.:100 1st Qu.:100 1st Qu.:66.67
Median :66.67 Median :100 Median :100 Median :66.67
Mean :66.67 Mean :100 Mean :100 Mean :66.67
3rd Qu.:66.67 3rd Qu.:100 3rd Qu.:100 3rd Qu.:66.67
Max. :66.67 Max. :100 Max. :100 Max. :66.67
GBM.RUN3 GLM.RUN3 MAXENT.Tsuruoka.RUN3 RF.RUN3
Min. :66.67 Min. :66.67 Min. :66.67 Min. :66.67
1st Qu.:66.67 1st Qu.:66.67 1st Qu.:66.67 1st Qu.:66.67
Median :66.67 Median :66.67 Median :66.67 Median :66.67
Mean :66.67 Mean :66.67 Mean :66.67 Mean :66.67
3rd Qu.:66.67 3rd Qu.:66.67 3rd Qu.:66.67 3rd Qu.:66.67
Max. :66.67 Max. :66.67 Max. :66.67 Max. :66.67
str(specificity)'data.frame': 2 obs. of 12 variables:
$ GBM.RUN1 : int 100 100
$ GLM.RUN1 : int 100 100
$ MAXENT.Tsuruoka.RUN1: int 100 100
$ RF.RUN1 : int 100 100
$ GBM.RUN2 : num 66.7 66.7
$ GLM.RUN2 : int 100 100
$ MAXENT.Tsuruoka.RUN2: int 100 100
$ RF.RUN2 : num 66.7 66.7
$ GBM.RUN3 : num 66.7 66.7
$ GLM.RUN3 : num 66.7 66.7
$ MAXENT.Tsuruoka.RUN3: num 66.7 66.7
$ RF.RUN3 : num 66.7 66.7
specificity #table of specificity by TSS or ROC per run per model GBM.RUN1 GLM.RUN1 MAXENT.Tsuruoka.RUN1 RF.RUN1 GBM.RUN2 GLM.RUN2
TSS 100 100 100 100 66.667 100
ROC 100 100 100 100 66.667 100
MAXENT.Tsuruoka.RUN2 RF.RUN2 GBM.RUN3 GLM.RUN3 MAXENT.Tsuruoka.RUN3
TSS 100 66.667 66.667 66.667 66.667
ROC 100 66.667 66.667 66.667 66.667
RF.RUN3
TSS 66.667
ROC 66.667
specTSS <- specificity[1, ]
specTran <- t(specTSS)
modelNames <- rownames(specTran)
model1 <- substr(modelNames, 1, 3)
specPlot1 <- data.frame(specTran, model1)
specROC <- specificity[2, ]
specTran <- t(specROC)
modelNames <- rownames(specTran)
model2 <- substr(modelNames, 1, 3)
specPlot2 <- data.frame(specTran, model2)
specData <- cbind(specPlot1, specPlot2)Let’s take a look at our plots for specificity:
specTSS.plot <- ggplot(specData, aes(x = model1, y = TSS, fill = model1)) +
geom_boxplot() + guides(fill = FALSE) + ggtitle("Specificity using TSS scores") +
theme(plot.subtitle = element_text(vjust = 1), plot.caption = element_text(vjust = 1),
panel.background = element_rect(fill = NA)) + labs(x = "Model Type",
y = "TSS (%)")
specROC.plot <- ggplot(specData, aes(x = model2, y = ROC, fill = model1)) +
geom_boxplot() + guides(fill = FALSE) + ggtitle("Specificity using ROC scores") +
theme(plot.subtitle = element_text(vjust = 1), plot.caption = element_text(vjust = 1),
panel.background = element_rect(fill = NA)) + labs(x = "Model Type",
y = "ROC (%)")
# Make them interactive!
specTSS.plot <- ggplotly(specTSS.plot)
specROC.plot <- ggplotly(specROC.plot)
# Plot them
specTSS.plotspecROC.plotWe can also print the variable importance for all models
# print variable importances
get_variables_importance(myBiomodModelOut)### save models evaluation scores and variables importance on hard drive
capture.output(get_evaluations(myBiomodModelOut), file = file.path(myRespName,
paste(myRespName, "_formal_models_eval.txt", sep = "")))
capture.output(get_variables_importance(myBiomodModelOut), file = file.path(myRespName,
paste(myRespName, "_formal_models_var_imp.txt", sep = "")))Ensemble of forecasting
One powerful feature from biomod2comes with BIOMOD_EnsembleModeling, which combines individual models to build an ensemble of forecast. In the following example, we exclude all models having a TSS score lower than 0.5.
### Building ensemble-models
myBiomodEM <- BIOMOD_EnsembleModeling(
# a "BIOMOD.models.out" returned by BIOMOD_Modeling
modeling.output = myBiomodModelOut,
# a character vector (either 'all' or a sub-selection of model names) that defines the models kept for building the ensemble model
chosen.models = 'all',
# The value chosen for this parameter will control the number of ensemble models built.
# 'all' means a total consensus model will be built
em.by='all',
# Evaluation metric used to build ensemble models
eval.metric = 'all',
# If not NULL, the minimum scores below which models will be excluded of the ensemble-models building
eval.metric.quality.threshold = c(0.7),
# Estimates the ... across predictions
prob.mean = T,
prob.cv = T,
prob.ci = T,
prob.ci.alpha = 0.05,
prob.median = T,
committee.averaging = T,
prob.mean.weight = T,
prob.mean.weight.decay = 'proportional')You can easily access to the data and outputs of myBiomodEM using some specific functions to make your life easier.
Projecting your outputs
### Make projections on current variable
myBiomodProj <- BIOMOD_Projection(modeling.output = myBiomodModelOut, new.env = myExpl,
xy.new.env = myRespCoord, proj.name = "current", selected.models = "all",
binary.meth = "TSS", compress = TRUE, clamping.mask = F, output.format = ".RData")
myCurrentProj <- getProjection(myBiomodProj)
myCurrentProj### Make ensemble-models projections on current variable
myBiomodEF <- BIOMOD_EnsembleForecasting(projection.output = myBiomodProj, binary.meth = "TSS",
total.consensus = TRUE, EM.output = myBiomodEM)
plot(myBiomodEF)Multi-species modelling
Now, that we know how to do single species modelling, what changes if we want to model the distribution of n species?
biomod2 has no function that allows multi-species to be modelled. But, we can use the help of other R functions to deal with this! For example, we can create a for loop that will execute the modelling parts for each species within our species data frame. See below, but do not execute it yet!
# define the species of interest
sp.names <- names(DataSpecies)[1:5]
for (sp.n in sp.names) {
myRespName = sp.n
cat("\n", myRespName, "modelling...")
### definition of data i.e keep only the column of our species
myResp <- as.numeric(DataSpecies[, myRespName])
myRespCoord = myRespCoord
### Initialisation
myBiomodData <- BIOMOD_FormatingData(resp.var = myResp, expl.var = myExpl,
resp.xy = myRespCoord, resp.name = myRespName)
### Options definition
myBiomodOption <- BIOMOD_ModelingOptions()
### Modelling
myBiomodModelOut <- BIOMOD_Modeling(myBiomodData, models = c("GBM", "GLM",
"MAXENT.Tsuruoka", "RF"), models.options = myBiomodOption, NbRunEval = 3,
DataSplit = 80, Prevalence = 0.5, VarImport = 3, models.eval.meth = c("TSS",
"ROC"), SaveObj = TRUE, rescal.all.models = FALSE, do.full.models = FALSE,
modeling.id = paste(myRespName, "CurrentClim", sep = ""))
### save models evaluation scores and variables importance on hard drive
capture.output(get_evaluations(myBiomodModelOut), file = file.path(myRespName,
paste(myRespName, "_formal_models_eval.txt", sep = "")))
capture.output(get_variables_importance(myBiomodModelOut), file = file.path(myRespName,
paste(myRespName, "_formal_models_var_imp.txt", sep = "")))
### Building ensemble-models
myBiomodEM <- BIOMOD_EnsembleModeling(modeling.output = myBiomodModelOut,
chosen.models = "all", em.by = "all", eval.metric = "all", eval.metric.quality.threshold = c(0.7),
prob.mean = T, prob.cv = T, prob.ci = T, prob.ci.alpha = 0.05, prob.median = T,
committee.averaging = T, prob.mean.weight = T, prob.mean.weight.decay = "proportional")
### Make projections on current variable
myBiomodProj <- BIOMOD_Projection(modeling.output = myBiomodModelOut, new.env = myExpl,
xy.new.env = myRespCoord, proj.name = "current", selected.models = "all",
binary.meth = "TSS", compress = TRUE, clamping.mask = F, output.format = ".RData")
### Make ensemble-models projections on current variable
myBiomodEF <- BIOMOD_EnsembleForecasting(projection.output = myBiomodProj,
binary.meth = "TSS", total.consensus = TRUE, EM.output = myBiomodEM)
}DIY: Create your own SDM!
### Make ensemble-models projections on current variable load the first speces
### binary maps which will define the mask
alphaMap <- reclassify(subset(myExpl, 1), c(-Inf, Inf, 0))
alphaMap <- get(load(paste(getwd(), "/", myRespName[1], "/proj_current/", "proj_current_",
myRespName[1], "_ensemble_TSSbin.RData", sep = "")))[, 1]